1 Introduction

1.1 Background

• Is crime generally rising in Chicago in the past decade (last 10 years)?

• Is there a seasonal component to the crime rate?

• Which time series method seems to capture the variation in your time series better?

Explain your choice of algorithm and its key assumptions

Student should be awarded the full (3) points if they address at least 2 of the above questions. The questions are by no means definitive, but can be used as a “guide” in the preparation of your project. The data contains a variety of offenses, but you can sample only a type of crime you’re interested in (eg. theft, narcotic, battery ) .Use visualization if it helps support your narrative.

Tasks

  1. Perform preprocessing steps to create a time series object.

  2. Demonstrate the analysis process of trend and seasonality using simple plotting tools.

  3. Build a forecasting model and explain your choice of algorithm and its key assumptions.

1.2 Data Source

The dataset consists of 7100712 observations and 22 variables from City of Chicago.

https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

1.2.1 Overview of Variables

id: Unique identifier for the record.

case_number: The Chicago Police Department Records Division Number, which is unique to the incident.

date: Date when the incident occurred.

block: Partially redacted address where the incident occurred.

iucr: Illinois Uniform Crime Reporting code (directly linked to primary_type and description)

primary_type: The primary description of the IUCR code.

description: The secondary description of the IUCR code, a subcategory of the primary description.

location_description: Description of the location where the incident occurred.

arrest: Indicates whether an arrest was made.

domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.

beat: Indicates the police beat where the incident occurred.

district: Indicates the police district where the incident occurred.

ward: The ward (City Council district) where the incident occurred.

community_area: Indicates the community area where the incident occurred.

fbi_code: Indicates the National Incident-Based Reporting System (NIBRS) crime classification. More details can be found in

x_coordinate: X coordinate of the incident location (State Plane Illinois East NAD 1983 projection).

y_coordinate: Y coordinate of the incident location (State Plane Illinois East NAD 1983 projection).

year: Year the incident occurred.

updated_on: Date and time the record was last updated.

latitude: The latitude of the location where the incident occurred.

longitude: The longitude of the location where the incident occurred.

location: Concatenation of latitude and longitude.

1.3 Libraries and Setup

To run the data preparation and the statistical analysis, the following libraries are loaded

## Warning: package 'Hmisc' was built under R version 3.6.3
## Warning: package 'forecast' was built under R version 3.6.3
## Warning: package 'prophet' was built under R version 3.6.3
## Warning: package 'rlang' was built under R version 3.6.3

2 Preprocessing

2.2 Checking missing values

Primary.Type n missing distinct 3362963 0 34

lowest : ARSON ASSAULT BATTERY BURGLARY CONCEALED CARRY LICENSE VIOLATION highest: ROBBERY SEX OFFENSE STALKING THEFT WEAPONS VIOLATION

2.3 Processing Date

## 'data.frame':    3362963 obs. of  26 variables:
##  $ ID                  : int  11778791 11738821 11030960 11046665 10426681 11047604 10145496 10409118 10900579 11010305 ...
##  $ Case.Number         : Factor w/ 3362691 levels "",".JB299184",..: 3247179 3223349 2709684 2720621 2325500 2721133 2159298 2315592 2620714 2695582 ...
##  $ Date                : Factor w/ 1343026 levels "01/01/2009 01:00:00 AM",..: 438 438 438 438 438 438 438 438 438 438 ...
##  $ Block               : Factor w/ 35824 levels "0000X E 100TH PL",..: 27695 28281 4538 16636 23504 15294 11880 31755 1302 32829 ...
##  $ IUCR                : Factor w/ 379 levels "0110","0141",..: 232 203 9 202 9 130 231 202 9 231 ...
##  $ Primary.Type        : Factor w/ 34 levels "ARSON","ASSAULT",..: 24 31 6 31 6 10 24 31 6 24 ...
##  $ Description         : Factor w/ 419 levels "$500 AND UNDER",..: 342 152 322 4 322 183 3 4 322 3 ...
##  $ Location.Description: Factor w/ 191 levels "","ABANDONED BUILDING",..: 140 140 140 140 140 140 19 19 140 140 ...
##  $ Arrest              : Factor w/ 2 levels "false","true": 1 1 2 1 2 1 1 1 1 1 ...
##  $ Domestic            : Factor w/ 2 levels "false","true": 2 1 1 1 2 1 2 1 2 1 ...
##  $ Beat                : int  1621 1611 621 831 932 133 2522 424 513 2223 ...
##  $ District            : int  16 16 6 8 9 1 25 4 5 22 ...
##  $ Ward                : int  39 41 21 17 16 3 31 10 34 21 ...
##  $ Community.Area      : int  12 10 71 66 61 35 20 46 49 73 ...
##  $ FBI.Code            : Factor w/ 26 levels "01A","01B","02",..: 3 20 3 20 3 14 23 20 3 23 ...
##  $ X.Coordinate        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Y.Coordinate        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Year                : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ Updated.On          : Factor w/ 3002 levels "01/01/2016 03:54:40 PM",..: 226 1506 2416 2537 764 1800 1856 322 849 1574 ...
##  $ Latitude            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Longitude           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Location            : Factor w/ 528050 levels "","(36.619446395, -91.686565684)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date0               : Factor w/ 1343026 levels "01/01/2009 01:00:00 AM",..: 438 438 438 438 438 438 438 438 438 438 ...
##  $ Date1               : Factor w/ 1343026 levels "01/01/2009 01:00:00 AM",..: 438 438 438 438 438 438 438 438 438 438 ...
##  $ Month               : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ YearMon             : Factor w/ 132 levels "2009-01","2009-02",..: 1 1 1 1 1 1 1 1 1 1 ...

2.4 Time Series Data Preparation

We will prepare 2 files, one to analyze and predict the overall crimes in Chicago, and the other to understand the historical trends of the Top 5 Crimes ‘@Year2009’ (namely THEFT, BATTERY, CRIMINAL DAMAGE, NARCOTICS, and BURGLARY) during the period 2009 - 2018.

2.4.1 Raw Files

## 'data.frame':    3104009 obs. of  6 variables:
##  $ ID          : int  11778791 11738821 11030960 11046665 10426681 11047604 10145496 10409118 10900579 11010305 ...
##  $ Date        : Factor w/ 1343026 levels "01/01/2009 01:00:00 AM",..: 438 438 438 438 438 438 438 438 438 438 ...
##  $ Primary.Type: Factor w/ 34 levels "ARSON","ASSAULT",..: 24 31 6 31 6 10 24 31 6 24 ...
##  $ Year        : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ Month       : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ YearMon     : Factor w/ 132 levels "2009-01","2009-02",..: 1 1 1 1 1 1 1 1 1 1 ...

2.4.2 Files for Exploratory Data Analysis

## 'data.frame':    50 obs. of  3 variables:
##  $ Primary.Type: Factor w/ 34 levels "ARSON","ASSAULT",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Year        : int  2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 ...
##  $ n           : int  68462 65403 60458 59136 54004 49449 48917 50293 49227 49812 ...
## 'data.frame':    600 obs. of  3 variables:
##  $ Primary.Type: Factor w/ 34 levels "ARSON","ASSAULT",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ YearMon     : Factor w/ 132 levels "2009-01","2009-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ n           : int  5037 4961 6211 5862 6826 6552 6244 6181 5789 5238 ...

3 Exploratory Data Analysis

Let explore the total no of crimes cases by broad categories using Primary.Type for namely THEFT, BATTERY, CRIMINAL DAMAGE, NARCOTICS, and BURGLARY during the 10 year period 2009 to 2018

## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: Primary_Type
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     reset: function
##     scale_name: manual
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

The plot for the TOP 5 CRIMES shows varying levels of decline during the period of 2009 and 2015/2016; and stabilizing thereafter. However, for THEFT, it seems to be on an upward trend from 2015 onwards.

From the plot, you can observe a seasonal peak in Top 5 crimes types during the month of July and August.

4 Time Series Analysis of Overall Crimes in Chicago

##     YearMon          n        
##  Min.   :  1   Min.   :16374  
##  1st Qu.: 28   1st Qu.:22820  
##  Median : 55   Median :25242  
##  Mean   : 55   Mean   :26206  
##  3rd Qu.: 82   3rd Qu.:29934  
##  Max.   :109   Max.   :35829

4.1 Exponential Smoothing State Space

## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = log(crimesdf1_ts[, 2])) 
## 
##   Smoothing parameters:
##     alpha = 0.412 
##     beta  = 0.0001 
##     gamma = 0.0001 
##     phi   = 0.9797 
## 
##   Initial states:
##     l = 10.4369 
##     b = -0.0089 
##     s = -0.0856 -0.0443 0.0483 0.0463 0.1156 0.1171
##            0.0759 0.0696 -0.0237 -0.0219 -0.2194 -0.078
## 
##   sigma:  0.0352
## 
##       AIC      AICc       BIC 
## -200.9189 -193.3189 -152.4747 
## 
## Training set error measures:
##                         ME       RMSE        MAE          MPE      MAPE
## Training set -0.0009265643 0.03230743 0.02377682 -0.009995252 0.2350457
##                   MASE       ACF1
## Training set 0.3864285 0.08187146

4.1.1 Decomposition of Time Series

4.1.2 Checking ACF and PACF

ACF(Auto-Correlation Function) - The correlation between the observation at the current time spot and the observations at previous time spots. PACF (Partial ACF) - The correlation between observations at two time spots given that we consider both observations are correlated to observations at other time spots.PACF of yesterday is the “real” correlation between today and yesterday after taking out the influence of the day before yesterday.

4.1.3 Forecast based on Exponential Smoothing State Space

## 
##  Box-Ljung test
## 
## data:  fc.ets$residuals
## X-squared = 17.411, df = 20, p-value = 0.6262

The correlogram shows that the autocorrelation for the in-sample forecast errors do not exceed the significance bounds for lags 1-20. Furthermore, the p-value from our Ljung-Box test is 0.62, indicating there is little evidence of a non-zero correlations at lags 1-20.

4.2 ARIMA

## Series: log(crimesdf1_ts[, 2]) 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.5803  -0.7699
## s.e.   0.0977   0.1394
## 
## sigma^2 estimated as 0.001432:  log likelihood=174.1
## AIC=-342.19   AICc=-341.93   BIC=-334.5
## 
## Training set error measures:
##                       ME       RMSE        MAE        MPE      MAPE      MASE
## Training set 0.002245556 0.03514098 0.02523751 0.02161875 0.2502806 0.4101681
##                   ACF1
## Training set 0.1066384

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,1,1)[12]
## Q* = 21.116, df = 20, p-value = 0.3903
## 
## Model df: 2.   Total lags used: 22

5 Summary

Is crime generally rising in Chicago in the past decade (last 10 years)?

The overall crimes have been declining from year 2009 to 2016 and somewhat stabilising. For specific crime types such as THEFT, there was however increasing trend after year 2015.

• Is there a seasonal component to the crime rate?

Yes. From visualising the Top 5 Crimes plots as well as what are reflected from both ETS and ARIMA model summary, we can safely confirm that there is a seasonal component to the crime rate.

• Which time series method seems to capture the variation in your time series better?

It is difficult to decide as both ETS and ARIMA models are approximately close in capturing the variation in the time series.

Explain your choice of algorithm and its key assumptions

Comparing ETS and ARIMA, ARIMA model is my preferred choice of algorithm for its flexibility to adjust the p, d, q values and seasonality as well as availability/flexibility of auto.ARIMA capability.

6 Appendix: Predict Overall Crimes in Chicago Using Prophet

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

6.1 Checking for missing values

## crimesProphet 
## 
##  4  Variables      3652  Observations
## --------------------------------------------------------------------------------
## Date 
##          n    missing   distinct       Info       Mean        Gmd        .05 
##       3652          0       3652          1 2013-12-31       1218 2009-07-02 
##        .10        .25        .50        .75        .90        .95 
## 2010-01-01 2011-07-02 2013-12-31 2016-07-01 2017-12-30 2018-07-01 
## 
## lowest : 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05
## highest: 2018-12-27 2018-12-28 2018-12-29 2018-12-30 2018-12-31
## --------------------------------------------------------------------------------
## n 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3652        0      708        1    849.9    183.6      621      663 
##      .25      .50      .75      .90      .95 
##      731      822      962     1087     1138 
## 
## lowest :  320  367  394  404  426, highest: 1404 1438 1523 1544 1833
## --------------------------------------------------------------------------------
## ds 
##          n    missing   distinct       Info       Mean        Gmd        .05 
##       3652          0       3652          1 2013-12-31       1218 2009-07-02 
##        .10        .25        .50        .75        .90        .95 
## 2010-01-01 2011-07-02 2013-12-31 2016-07-01 2017-12-30 2018-07-01 
## 
## lowest : 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05
## highest: 2018-12-27 2018-12-28 2018-12-29 2018-12-30 2018-12-31
## --------------------------------------------------------------------------------
## y 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3652        0      708        1    849.9    183.6      621      663 
##      .25      .50      .75      .90      .95 
##      731      822      962     1087     1138 
## 
## lowest :  320  367  394  404  426, highest: 1404 1438 1523 1544 1833
## --------------------------------------------------------------------------------

6.2 Basic Predictions

##        ds                          trend        additive_terms      
##  Min.   :2009-01-01 00:00:00   Min.   : 727.0   Min.   :-163.69511  
##  1st Qu.:2012-01-01 06:00:00   1st Qu.: 732.6   1st Qu.: -55.91230  
##  Median :2014-12-31 12:00:00   Median : 736.8   Median :  11.17632  
##  Mean   :2014-12-31 12:00:00   Mean   : 830.2   Mean   :   0.02585  
##  3rd Qu.:2017-12-30 18:00:00   3rd Qu.: 945.5   3rd Qu.:  57.07482  
##  Max.   :2020-12-30 00:00:00   Max.   :1111.2   Max.   : 137.82315  
##  additive_terms_lower additive_terms_upper     weekly         weekly_lower    
##  Min.   :-163.69511   Min.   :-163.69511   Min.   :-37.518   Min.   :-37.518  
##  1st Qu.: -55.91230   1st Qu.: -55.91230   1st Qu.:-10.853   1st Qu.:-10.853  
##  Median :  11.17632   Median :  11.17632   Median : -1.508   Median : -1.508  
##  Mean   :   0.02585   Mean   :   0.02585   Mean   :  0.000   Mean   :  0.000  
##  3rd Qu.:  57.07482   3rd Qu.:  57.07482   3rd Qu.:  4.519   3rd Qu.:  4.519  
##  Max.   : 137.82315   Max.   : 137.82315   Max.   : 46.297   Max.   : 46.297  
##   weekly_upper         yearly            yearly_lower       
##  Min.   :-37.518   Min.   :-126.19285   Min.   :-126.19285  
##  1st Qu.:-10.853   1st Qu.: -54.10178   1st Qu.: -54.10178  
##  Median : -1.508   Median :  13.41738   Median :  13.41738  
##  Mean   :  0.000   Mean   :   0.02585   Mean   :   0.02585  
##  3rd Qu.:  4.519   3rd Qu.:  57.70804   3rd Qu.:  57.70804  
##  Max.   : 46.297   Max.   :  91.53803   Max.   :  91.53803  
##   yearly_upper        multiplicative_terms multiplicative_terms_lower
##  Min.   :-126.19285   Min.   :0            Min.   :0                 
##  1st Qu.: -54.10178   1st Qu.:0            1st Qu.:0                 
##  Median :  13.41738   Median :0            Median :0                 
##  Mean   :   0.02585   Mean   :0            Mean   :0                 
##  3rd Qu.:  57.70804   3rd Qu.:0            3rd Qu.:0                 
##  Max.   :  91.53803   Max.   :0            Max.   :0                 
##  multiplicative_terms_upper   yhat_lower       yhat_upper      trend_lower    
##  Min.   :0                  Min.   : 467.7   Min.   : 662.2   Min.   : 700.0  
##  1st Qu.:0                  1st Qu.: 627.5   1st Qu.: 822.4   1st Qu.: 732.1  
##  Median :0                  Median : 708.4   Median : 902.7   Median : 736.8  
##  Mean   :0                  Mean   : 733.6   Mean   : 927.2   Mean   : 828.3  
##  3rd Qu.:0                  3rd Qu.: 836.4   3rd Qu.:1027.1   3rd Qu.: 945.5  
##  Max.   :0                  Max.   :1115.2   Max.   :1311.1   Max.   :1111.2  
##   trend_upper          yhat       
##  Min.   : 727.0   Min.   : 567.5  
##  1st Qu.: 734.2   1st Qu.: 725.2  
##  Median : 752.7   Median : 805.6  
##  Mean   : 832.2   Mean   : 830.3  
##  3rd Qu.: 945.5   3rd Qu.: 932.0  
##  Max.   :1111.2   Max.   :1209.8

6.3 Visualization

## Warning: Aspect ratios aren't yet implemented, but you can manually set a
## suitable height/width

## Warning: Aspect ratios aren't yet implemented, but you can manually set a
## suitable height/width

Using the graph we observe the yearly trend and seasonality much clearer and how these are used for making predictions.

6.3.1 Forecast Components

Based on the chart, we can observe a downward annual trend for the overall crimes from 2009 to 2015/2016 and thereafter plateauing till 2018. From a seasonality perspective, the overall crime peaks during the months of July/August each year. Lastly, on a weekly basis, we can also observe an increasing crimes rate with a spike on Friday and tapering over the weekend.